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Abstract 

Given an undirected graph and < e < 1, a set of nodes is called e-near clique if all but an e fraction 
of the pairs of nodes in the set have a link between them. In this paper we present a fast synchronous 
network algorithm that uses small messages and finds a near-clique. Specifically, we present a constant-time 
algorithm that finds, with constant probability of success, a linear size e-near clique if there exists an e 3 -near 
clique of linear size in the graph. The algorithm uses messages of 0(log n) bits. The failure probability can 
be reduced to in O(logn) time, and the algorithm also works if the graph contains a clique of size 

f2(n/ log" logn) for some a e (0, 1). Our approach is based on a new idea of adapting property testing 
algorithms to the distributed setting. 
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1 Introduction 



Discovering dense subgraphs is an important task both theoretically and practically. From the theoretical point 
of view, clique detection is a fundamental problem in the theory of computational complexity, and for distributed 
algorithms, computing useful constructs of the underlying communication graph is one of the central goals. Let 
us elaborate a little about that. 

Dense graph detection has always been an important problem for clustering and hierarchical decomposition 
of large systems for administrative purposes, for routing and possibly other purposes H. Another reason to 
consider dense subgraphs is conflicts in radio ad-hoc networks [12]. On top of these low-level communication- 
related tasks, dense subgraph detection has recently also attracted considerable interest for Web analysis: as is 
well known, the ranking of results generated by search engines such as Google's PageRank JH is derived from 
the topology of the Web graph; in particular, it can be heavily influenced by "tightly knit communities" fi31 . 
which are essentially dense subgraphs. Hence, to understand the structure of the web, it is important to be able 
to identify such communities. Another dimension where dense subgraphs are interesting for the Web is time: 
it has been observed [14] that evolution of links in blogs is, to some extent, a sequence of significant events, 
where significant events are characterized as dense subgraphs. Thus, considering the web as a dynamic graph, 
identifying large dense subgraphs is useful in understanding its temporal aspect. 

Our Contribution. In this paper we give an efficient randomized distributed algorithm that finds large 
dense subgraphs. Obviously, our algorithm does not decide whether there exists a large clique in the graph: 
that would be impossible to do efficiently unless P=NP. Instead, our algorithm solves a relaxed problem. First, 
we find near-cliques, defined as follows. Given a graph and a constant e > 0, a set of nodes D is said to be 
an e-near clique if all, except perhaps an e fraction of the pairs of nodes of D have an edge between them (see 
Section [2]for more details). For example, using this definition, a clique is 0-near clique. Second, our algorithm 
only identifies a large near-clique, and it is only guaranteed that the density of the output is close to the best 
possible. For example, given a graph G and a constant e > such that G contains an e-near clique with a 
linear number of nodes, our algorithm finds at least one e 1 / 3 -near clique of linear size in G. (Our algorithm 
can also discover dense subgraphs of sublinear size for smaller values of e.) Our algorithm is extremely frugal: 
the output is computed (with constant probability of success) in constant number of rounds, and all messages 
contain 0(log n) bitsQ Given any q > 0, it is possible to amplify the success probability to 1 — q in 0(log(l/g)) 
time. 

In addition to the direct contribution of the algorithm, we believe that our methodology is interesting in its 
own right. Specifically, our work extends ideas presented in ifTOll in relation to property testing of the />clique 
problem (defined below). Even though our construction does not use the property tester of iPTOl as a black box, 
our approach of deriving a distributed algorithm from graph property testers seems to be an interesting idea to 
consider when approaching other problems as well. In a nutshell, property testers do very little overall work 
but have a "random access" probing capability, namely they can probe topologically distant edges; distributed 
algorithms, on the other hand, can do a lot of work (in parallel), but information flow is local, i.e., an algorithm 
which runs for T rounds allows each node to gather information only from distance at most T. However, quite 
a few graph property testers exhibit some locality that can be exploited by distributed algorithms. 

Related work. We are not aware of any previous distributed algorithm that finds large dense subgraphs 

1 If messages may be of unbounded size, the problem becomes both trivial (from the communication viewpoint) and infeasible 
(from the computation viewpoint). See Section[3] 
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efficiently. Maximal independent sets, which are cliques in the complement graph, can be found efficiently 
distributively lfT6l l2l. In this case, there can be no non-trivial guarantee about their size with respect to the 
size of the largest (maximum) independent set in the graph. But on the positive side, the sets output by these 
algorithms are strictly independent. 

Much more is known about dense subgraphs in the centralized setting. The fundamental result is that finding 
the largest clique (i.e., fully connected subset of nodes) in a graph, or even approximating its size to within a 
factor of n 1 ~ e for any constant e > 0, is computationally hard Ifl3l . There are some closely related results in 
the centralized model and in the property testing model. In the centralized model, the Dense A;-Subgraph (DkS) 
problem was studied. In DkS, the input consists of a graph and a positive integer k, and the goal is to find a 
the subset of k nodes with the most number of edges between them. Feige, Peleg and Kortsarz [7 ] present a 
centralized algorithm approximating DkS within a factor of 0(n s ) for a certain 6 < 1/3, and it is also possible 
to approximate DkS to within roughly n/k (H. Abello, Resende and Sudarsky [1] presented a heuristic for 
finding near-cliques (which they refer to as "Quasi-Cliques") in sparse graphs. 

Property testing was defined by Rubinfeld and Sudan |[2TI for algebraic properties, and extended by Gol- 
dreich, Goldwasser and Ron iflOl to combinatorial graph properties. The relevant concepts are the following. 
In the dense graph model, the basic action of a property tester is to query whether a pair of nodes is connected 
by an edge in the graph. An n-node graph is said to have the p-clique property if it contains a clique of size 
pn, for some given parameter < p < 1. The p-clique tester of [10] gets an n-node graph G and constants 
p, e as input, and decides, using O (l/e 6 ) queries and with constant probability of being correct, whether the 
input graph has a p-clique or whether no set of pn nodes in G is (e/p 2 )-near clique. They further present an 
"approximate find" algorithm that, provided that the property tester answers in the affirmative, finds an e-near 
clique of size pn in the graph in 0(n) time. Our algorithm is a new variant of the ideas of [10] and, using a 
new analysis, gets a better complexity result in the case of the relaxed assumption of existence of a near-clique. 

This relaxation is a special case of tolerant property testing |fl9ll , which in our case can be defined as 
follows. An (ei, £2)-tolerant p-clique tester takes parameters p, ei and e2 where e\ < £2, and decides whether 
the graph contains an ei-near clique or whether no set of pn nodes is an e2-near clique. The general results of 
1 19 ] imply that the property tester of IflOl is in fact (e 6 , e)-tolerant (our construction is (e 3 , e)-tolerant). Fischer 
and Newman [9] prove a general result (for any property testable in 0(1) queries), whose implication to our 
case is that it is possible to find the smallest e for which a graph has an e-near clique of size pn, but the query 
complexity is an exponent-tower of height poly(e _1 ). 

A relation between distributed algorithms and property testers was pointed out by Parnas and Ron in [18], 
where it is shown for Vertex Cover how to derive a good property tester from a good distributed algorithm 
(the reduction goes in the direction opposite to the one we propose in this paper). Recently, techniques from 
property testing were used, along with other techniques by Nguyen and Onak fl7l . to present constant-time ap- 
proximation algorithms for vertex-cover and maximum-matching in bounded-degree graphs. Their techniques 
also yield constant-time distributed algorithms for these problems. Saks and Seshadhri ll22l show how to devise 
a parallel algorithm that "reconstructs" a noisy monotone function, again using ideas from property testing. 

Paper organization. The problem and main results are stated in Section [2] Simple solutions are discussed 
in Section [3] The algorithm is presented in Section [4] and analyzed in Section [5] We conclude in Section [6] 
Some proofs are presented in an appendix. 



2 



2 Definitions, Model, Results 



Graph concepts. In this paper we assume that we are given a simple undirected graph G = (V, E). We denote 
n = f \V\. For any given set U C V of nodes, F(U) denotes the set of all neighbors of nodes in U. Formally, 
Y{U) d = {v : ueU and (u, v) G E}. 

For counting purposes, we use a slightly unusual approach, and view each undirected edge {u, v} as two 
anti-symmetrical directed edges (u, v) and (v, u). Using this approach, we define the following central concept. 



Definition 1 Let G = (V, E) be a graph. A set of nodes D C V is called e-near clique ;/ 

{(u,v) : (u,v) G D x Band {u,v} G E} > (1 - e) • \D\ ■ (\D\ - 1). 
In such case we also say that the density of D is at least 1 — e. 

Distributed Algorithms. We use the standard synchronous distributed model CONGEST as defined in [20]. 
Briefly, the system is modeled by an undirected graph, where nodes represent processors and edges represent 
communication links. It is assumed that each node has a unique O(logn) bit identifier. An execution starts 
synchronously and proceeds in rounds: in each round each node sends messages (possibly different messages 
to different neighbors), receives messages, and does some local computation. By the end of the execution, each 
processor writes its output in a local register. A key constraint in the CONGEST model is that the messages 
contain O(logn) bits, which intuitively means that each message can describe a constant number of nodes, 
edges, and polynomially-bounded numbers. The time complexity of the algorithm is the maximal number of 
rounds required to compute all output values. We note that we assume no processor crashes, and therefore any 
synchronous algorithm can be executed in an asynchronous environment using a synchronizer l3l . 
Problem Statement. In this paper we consider algorithms for finding e-near clique. The input to the algorithm 
is the underlying communication graph and e. Each node has an output register, which holds, when the algo- 
rithm terminates, either a special value "_L" or a label. All nodes with the same output label are in the same 
e-near clique, and _L means that the node is not associated with any near-clique. Note that there may be more 
than one near-cliques in the output. 

Results. The main result of this paper is given below (see Theorem 15 .7 1 for a detailed version). 

Theorem 2.1 Let e,5 > 0. If there exists an e'-near clique D C V with \D\ > Sn, then an 0(e/S)-near 
clique D 1 with \D'\ = \D\ ■ (1 — 0(e)) can be found by a distributed algorithm with probability 0(1), in 
2°( e s log ( e s v rounds, using messages of O (log n) bits. 

We stress that the message length is a function of n and is independent of e, 5. 

Let us list a few immediate corollaries to our result. First, for the case where there are near-cliques of linear 
size (i.e., 5 = Q (1)). 

Corollary 2.2 Let e be a constant. If there exists an e'-near clique D C V with \D\ = Q(n), then an 0(e)- 
near clique D' with \D'\ = \D\ ■ (1 — O(e)) can be found by a distributed algorithm with probability £1(1), in 
0(1) rounds and using messages ofO(log n) bits. 

Second, for the case where there are strict cliques of (slightly) sublinear size. 
Corollary 2.3 If there exists a clique D with \D\ > nj log" log nfor a sufficiently small constant a > 0, then 
an o(V)-near clique D' with \D'\ > (1 — o(l)) • |D| can be found by a distributed algorithm with probability 
1 — o(l), in polylogarithmic number of rounds and using messages of '0(log n) bits. 
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3 Simple Approaches 



In this section we consider, as a warm-up, two simplistic approaches to solving the near-clique problem, and 
explain why they fail. 

The neighbors' neighbors algorithm. The first idea is to let each node inform all its neighbors about all 
its neighbors. This way, after one communication round, each node knows the topology of the graph to distance 
2, and can therefore find the largest clique it is a member of. It is easy to kill cliques that intersect larger cliques 
(using, say, the smallest ID of a clique as a tie-breaker), and so we can output a set of locally largest cliques in 
a constant number of rounds. Indeed, one can develop a correct algorithm based on these ideas, but there are 
two show-stopper problems in this case. First, the size of a message sent in this algorithm may be very large: 
a message may contain all node IDs. (This is the LOCAL model |[20l ."). And second, the algorithm requires 
each node to locally solve the largest clique problem, which is notoriously hard to compute. We thus rule out 
this algorithm on the basis of prohibitive computational and communication complexity. 

The shingles approach. Based on the idea of shingles [6], one may consider the following algorithm. Each 
node picks a random ID (from a space large enough so that the probability of collision is negligible), sends it 
out to all its neighbors, and then selects the smallest ID it knows (among its neighbors and itself) to be its label. 
All nodes with the same label are said to be in the same candidate set. Each candidate set finds its density by 
letting all nodes send their degree in the set to the set leader (the namesake of the set label), and only sets with 
sufficient size and density survive. Conflicts due to overlapping sets are resolved in favor of the larger set, and 
if equal in size, in favor of the smaller label. Call this the "shingles algorithm." 

Clearly, if there is a clique of linear size in the graph, then with probability 17(1) the globally minimal ID 
will be selected by a node in the clique, in which case all nodes in the clique belong to the same candidate 
set. Unfortunately, many other nodes not in the clique may also be included in that candidate set, "diluting" it 
significantly. Formally, we claim the following. 

Claim 1 For any constant 5 £ (0, 1) there exists an infinite family of graphs {G n } such that G n has n nodes 
and it contains a clique of size Sn, but for all e < min | j^f , 1/9 j- and for sufficiently large n, the shingles 
algorithm cannot find an e-near clique with at least (1 — e)5n nodes in G n . 
Proof: Fix 5 € (0, 1) and consider, for simplicity, n such that 5n, n are 
even. The graph G n is defined as follows. The nodes of G n are parti- 
tioned into four sets denoted Ci,C2, I\, h, where \C\\ = | C2 1 = Sn/2, 
\h I = 1-^2 1 = (1 — S)n/2. The sets C\, C2 are complete subgraphs and /1, 12 
are independent sets (see Figure [TJ). The pairs of sets (I\,G\), (Ci,C2), 
(C2, 12) are connected with complete bipartite graphs (i.e., every node in I\ 
is connected to every node in C\ and similarly for the other pairs). The re- 
sulting graph contains a clique C = C\ U C2 of size 5n. 

We proceed by case analysis. Let v m i n denote the node with the globally 
minimal ID in G n , as drawn by the shingle algorithm. 
Case 1: v m i n € C\ U C2. W.l.o.g assume that v m i n € C\. Then t> m i n 's 
candidate set contains exactly C\ U C2 U I\ , a set whose density is 

pi+|c 2 |) + , _ (%) +S (l-6)n 2 /4 _ 

(|Ci|+|C 2 |+|IiK f(l+S)n/2\ 1 + 5' 



c 2 



h 



Figure 1: Crosses represent full 
connectivity. 
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and for e < the density is less than 1 — e. Clearly in this case all other candidates are subsets of I\ U I 2 and 
thus have density 0. 

Case 2: v m i n € I\ U I 2 - W.l.o.g assume that v m - m G I\. Then u m i n 's candidate set is exactly C\ U {u m in} and 
thus has size Sn/2 + 1 which is asymptotically smaller than (1 — e)5n for any constant e < 1/2. 
Finally, consider the other candidate sets in this case. Clearly all nodes in C 2 belong to the same candidate set. 
Let A denote the set of vertices from I\ U I2 belonging to C2's candidate set. If \A\ < 5n/A then the candidate 
set size is \C^\ + \A\ < 3<5n/4 which is less than (1 — e)5n for all e < 1/4. If \A\ > <5n/4 then the candidate 
set density is at most 

(\ C *) + \C 2 \-\A\ < 1 1 - 4/gn 
(|Ca|+|A|\ - 3 • (3 - 4/<5n) 

which is asymptotically less than 1 — e for any e smaller than 1/9. The remaining candidate sets are subsets of 
I\ U I 2 and thus have density 0. □ 
Summary. The simple approaches demonstrate the basic difficulty of the distributed e-near clique problem: 
looking to distance 1 is not sufficient, but looking to distance 2 is too costly. The algorithm presented next finds 
a middle ground using sampling. 



4 Algorithm 

Below we present the algorithm for finding dense subgraphs. Analysis is presented in Section [5] 
The basic idea. Let V' Q V be a set of nodes. Define K(V) to be the set of all nodes which are adjacent 
to all other nodes in V', i.e., K(V) d = {v : T(v) ~DV'\ {v}}. Further define T(V) to be the set of nodes 
in K(V) that are adjacent to all nodes in K(V), i.e., T{V) d = {v G K(V) : T(v) D K{V) \ {v}}. Our 
starting point is the following key observation (essentially made in [ 10]). If D is a clique, then D C K(D), and 
also, by definition, D C T(D). Furthermore, T(D) is a clique since each v G T{D) is adjacent to all vertices 
in K{D) and in particular those in T{D). 

The algorithm finds a set which is roughly T(D), where D is the existing near-clique, by random sampling. 
Suppose that we are somehow given a random sample X of D. Consider K{X): it is possible that K(X) % 
K{D), because K(X) is the set of nodes that are adjacent to all nodes in X, but not necessarily to all nodes in 
D. We therefore relax the definitions of K(X) and T(X) to approximate ones K e (X) and T e (X). Finally, we 
overcome the difficulty of inability to sample D directly (because D is unknown), by taking a random sample 

5 of V, trying all its subsets X C S (\S\ is polynomial in 1/e), and outputting the maximal T(X) found. 
Description and implementation details. We now present the algorithm in detail. We shall use the following 
notation. Let X C V be a set of nodes, and let < e < 1. We denote by K e (X) the set of nodes which are 
neighbors of all but an e-fraction of the nodes in X, i.e., 

K e {X) d = {v£V : \T(v) Ci X\ > (1 - e)\X\} . (1) 

Using the notion of K e , we also define 

T e (X) = K £ (K 2e2 (X)) n K 2e2 (X) . (2) 
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Algorithm DistNearClique 



Input: Graph G = (V, E),e>0,pe (0, 1). 

Output: A label label,, E V U {_L} at each node v, such that u and v are in the same near clique iff labels = label u ^ J_. 

Sampling stage. Each node joins a set S with probability p (i.i.d). Let G[S] denote the subgraph of G induced by S. 
Exploration stage: Finding near-clique "candidates". 

(1) Construct a rooted spanning tree for each connected component of G[S]. By the end of this step, each node v £ S 
has a variable parent(v) that points to one of its neighbors (for the root, parent(u) = NULL). 

(2) Each node in S finds the identity of all nodes in its connected component and stores them in a variable Comp(t;). 

(3) Each node »eS sends Comp(v) to all its neighbors T(v). A node u E T(S) may receive at this step messages 
from several nodes, that may or may not be in different components of S. Each node u E T(S) sets a parent 
pointer parent s (u) for each connected component Si of G[S] that u is adjacent to (choosing arbitrarily between 
its neighbors from the same Si). 

(4) Let u E r(iS'). Let Si, . ■ ■ , Sg be the different connected components which are adjacent to u. For each Si where 
1 < i < 4 the following procedure is executed. 

(4a) For all subsets X C Si, u determines (using the information received in StepO if u E K 2e 2 (X). 

(4b) u sends the results of the computations (2' Si bits) to all its neighbors, including parent^ (u). 

(4c) This information is sent up to the root of Si, summing the counts for each X along the way, so that the root 

of Si knows the value of |Jf2e 2 P0l for each X C Si. 
(4d) The root sends the value of | K 2e 2 (X) \ down back to all nodes in T(Si). 
(4e) Each node v E K 2e 2(X) sends |if 2 e 2 (^)l to all its neighbors, for each X C Si. 

(4f) Each node u E r(5j) finds whether u E K c (K 2e i{X)) for each X C Si, and thus determines whether 
u E T e (X) for each X. 
Decision stage: Conflict resolution. 

(1) For each connected component Si, the size of T e (X) is computed for each X C Si similarly to Steps l4bl44"cl of the 
exploration stage. Let X(Si) be the subset that maximizes |T e (X)| over all X C Si. 

(2) The root of each component Si sends \T € (X(Si)) \ out to all nodes in T(Si). 

(3) After receiving \T e (X (Si)) \ for all relevant connected components, each node sends an "acknowledge" message 
to the component reporting the largest \T e (X(Si))\, breaking ties in favor of the largest root ID, and an "abort" 
message to all other components. 

(4) If no node in T(Si) sent an "abort" message to Si, the root sends back the result to all nodes in T e (X(Si)) (this is 
done by sending X(Si)). The label of a node in T e (X(Si)) is the root ID of Si, and _L otherwise. 

The algorithm works in stages as follows. In the sampling stage, a random sample of nodes S is se- 
lected; the exploration stage generates near-clique candidates by considering T e (X) for all X C Si s.t. Si 
is a connected component of the induced subgraph G[S}; and the decision stage resolves conflicts between 
intersecting candidates. Pseudo-code for Algorithm DistNearClique is presented above. A detailed explanation 
of the distributed implementation of Algorithm DistNearClique follows. 

The sampling stage is trivial: each node locally flips a biased coin, so that the node enters S with probability 
p (p is a parameter to be fixed later). This step is completely local, and by its end, each node knows whether it 
is a member of S or not. 

The exploration stage is the heart of our algorithm. To facilitate it, we first construct a spanning tree for 
each connected components of G[S] (Step Q] of the exploration stage). This construction is implemented by 
constructing a BFS spanning tree of each connected component Si, rooted at the node with the smallest ID in 
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Si. This is a standard distributed procedure (see, e.g., |20| ), but here only the nodes in S take part, and all other 
nodes are non-existent for the purpose of this protocol. 

In Step |2] of the exploration stage, all nodes send their IDs to the root. Once the root has all IDs, it sends 
them back down the tree. 

In Step[3]of the exploration stage, each node in Si sends the identity of all nodes in Si to all its neighbors. 
In addition, we effectively add to each spanning tree all adjacent nodes. This is important so that we avoid 
over-counting later. Note that a node of S is member of a single tree (the tree of its connected component), but 
a node in V \ S may have more than one parent pointer: it has exactly one pointer for each component it is 
adjacent to. 

Step @] of the exploration stage determines for each node its membership in T e (X) for each subset X of 
each connected component. Consider a node u € T(Si). After Step[3j u knows the IDs of all members of Si, so 
it can locally enumerate all 2^ Si \ subsets X C Si, and furthermore, u can determine whether it € K 2t 2(X) for 
each such subset X. Thus, each such node u locally computes bits: one for each possible subset X C Si. 
We assume that the coordinates of the resulting vector are ordered in a well known way (say, lexicographically). 
These vectors are sent by each node u £ T(Si) to all its neighbors, and in particular to its parent in Si. This is 
done by u for each Si it is adjacent to. Step@c]is implemented using standard convergecast on the tree spanning 
Si : the vectors are summed coordinate- wise and sent up the tree, so that when the information reaches the root 
of Si, it knows the size of K 2e n(X) for each X C Si. Finally, using the size of K 2e 2(X), and knowing which 
of its neighbors is in K 2t 2 (X), each node u can determine whether u G K e (K 2e 2 (X)), and thus decide whether 
it is in T e (X) for each of the possible subsets X. 

When the decision stage of Algorithm DistNearClique starts, each connected component Si of G[S] has 
a "candidate" near-clique and we need to choose the largest T € (X) over all X's. The difficulty is that there 
may be more than one set that qualifies as a near-clique, and these sets may overlap. Just outputting the union 
of these sets may be wrong because in general, the union of e-near clique need not be an e-near clique. The 
decision stage resolves this difficulty by allowing each node to "vote" only for the largest subset it is a member 
of. This vote is implemented by killing all other subsets using 'abort' messages, which is routed to the root of 
the spanning tree constructed in the exploration stage. This ensures that from each collection of overlapping 
sets, the largest one survives. Some small node sets may also have non-_L output: they can be disqualified if a 
lower bound on the size of the dense subgraph is known. 

4.1 Wrappers 

To conclude the description of the algorithm, we explain how to obtain a deterministic upper bound on the 
running time, and how to decrease error probability. 

• Bounding the running time. As we argue in Section [57X1 the time complexity of the algorithm can be bounded 
with some constant probability. If a deterministic bound on the running time is desired, one can add a counter 
at each node, and abort the algorithm if the running time exceeds the specified time limit. 

• Boosting the success probability. The way to decrease the failure probability is not simply running the 
algorithm multiple times. Rather, only the sampling and exploration stages are run several times independently, 
and then apply a single decision stage to select the output. More specifically, say we want to achieve success 

def 

probability of at least 1 — q for some given q > 0. Let A = log 1 _ r q. To get failure probability at most q, we run 
A independent versions of the sampling and exploration stages (in any interleaving order). These A versions are 
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run with a deterministic time bound as explained above. When all versions terminate, a single decision stage is 
run, and in Step[3]of the decision stage, nodes consider candidates from all A versions, and choose (by sending 
"acknowledge") only the largest of these candidates. This boosting wrapper increases the running time by a 
factor of A: the sampling and exploration stages are run A times, and the decision stage is slower by a factor of 
A due to congestion on the links. 

5 Analysis 

In this section we sketch the analysis of Algorithm DistNearClique presented in Section [4] Full details (i.e., 
most proofs) are presented in the appendix. 

5.1 Complexity 

We first state the time complexity in terms of the sample size, and then bound the sample size. 

Lemma 5.1 Let S be the set of nodes sampled in the sampling stage of Algorithm DistNearClique. Then the 
round complexity of the algorithm is at most O (2l s l). 

r, -. P n 

Lemma 5.2 Pr[|5| < 2pn] > 1 - e~~ . 

5.2 Correctness 

In this section we prove that Algorithm DistNearClique finds a large near-clique. We note that while the algo- 
rithm appears similar to the p-clique algorithm in ifTUl . the analysis of Algorithm DistNearClique is different. 
We need to account for the fact that the input contains a near-clique (rather than a clique), and we need to 
establish certain locality properties to show feasibility of a distributed implementation. 

For the remainder of this section, fix G = (V, E), e > 0, and 5 > 0. Let \ V\ = n. Assume that D C V is 
an e 3 -near clique satisfying \D\ > Sn. Recall that G[S] denotes the subgraph of G induced by S. In addition, 
assume that e < | (larger values are meaningless, see parameters of Theorem l5.7l ). 

Let D' denote the set of nodes output by Algorithm DistNearClique. Clearly, D' = T £ (X) for some X. We 
first show that every T e (X) is je-near clique where t = \T e (X)\. In the decision stage, the algorithm selects 
the largest T e (X). In Lemma [531 we prove our main technical result, namely that with constant probability, 
there exists a subset X* C Si with \T € (X*)\ > (1 - (e)) \D\. 

All large T e (X) are near-cliques. The following lemma proves that any T e (X) is a near-clique with a param- 
eter relating to its size. 

Lemma 5.3 Let X C V, and denote t = \T e (X)\. Then T e (X) is -near clique. 

Existence of a large T £ (X). We prove the existence of a connected set X* C S such that T e (X*) is large. 

First, let C denote the set of all nodes in the e 3 -near clique D that are also adjacent to all but e 2 fraction of 
D. Formally: C = K e i (D) n D where D is e 3 -near clique. We use the following simple property. 
Lemma 5.4 |C| > (1 - e) \D\ - j,. 

Second, we structure the probability space defined by the sampling stage of Algorithm DistNearClique as 
follows. In the algorithm, each node flips a coin with probability p of getting "heads" (i.e., entering S). We 
view this as a two-stage process, where each node flips two independent coins: coini with probability p\ = f p/2 
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of getting "heads" and coin 2 with probability p 2 = > p/2 of getting "heads." A node enters S iff at least 
one of its coins turned out to be "heads." The idea is that the net result of the process is that each node enters S 
independently with probability p, but this refinement allows us to define two subsets of S: let be the set of 
nodes for which coiri! is heads, and let be the set of nodes for which coin 2 is heads. 

Combining the notions, we define X* = n C, i.e., X* is a random variable representing the set of 
nodes from C for which coini is heads. X* is effectively a sample of C where each node is selected with 
probability p/2. We have the following. 

Lemma 5.5 X* resides within a single connected component ofG[S] with probability at least 1 — e -^( s P n )_ 

We now arrive at our main lemma. 
Lemma 5.6 With probability at least 1 — ^e - ^ <5 pn ) over the selection of S, there exists a connected 
component Si ofG[S] and a set X* C Si s.t. \T t (X*) \ > (1 - ^e) \D\ - e~ 2 . 

Proof: Let X* be defined as above. It remains to show that T e (X*) is large. Intuitively, X* is a random sample 
of C, and since C contains almost all of D, X* is also, in a sense, a sample of D. Thus K 2e 2(X*) should 
be very close to Kt.\{C), K^(D) for appropriately selected (•). This would complete the proof since T e (C) 
contains almost all of C which, in turn, contains almost all of D. Formally, we say that X* is representative if 
the following hold. 

1. \KMD)\K 2t 2(X*)\<e\C\. 

2. \K 2e2 (X*)\K 3e2 (C)\<e 2 \C\. 

That is, if K 2e 2{X*) is almost fully contained in K e 2(D) and almost fully contains K 3e 2(C). 

To complete the proof, we use two claims presented below. Claim|2]shows that if X* is representative, then 
|C \ T e (X*)\ < • |C|. Claim[3]shows that X* is representative with probability 1 - -^ e ~ n ( e45pn ) . Given 
these claims, the proof is completed as follows. By Lemma [531 and the claims, we have that 1 — (e~ n ( Spn ^ — 
_i_ e -^( e Spn)^ we kgyg t h at x* resides in a connected component of G[S], and, using also Lemma 15741 the 
proof is complete, because 

,^, ls ( 1 _-) |c|£ ( 1 _^)( (1 _ £)|DI 4) i ( 1 _-) |ol 4. □ 

Claim 2 If X* is representative, then \C \ T e (X*)\ < • \C\ 
Claim 3 Pr [X* is representative] > 1 - ^ • e -^ 4 M 

5.3 Summary 

We summarize with the following theorem, which is the detailed version of Theorem 12. 11 (in Theorem 12.11 we 

Theorem 5.7 Let G = (V, E), \V\ = n. Let D C y be an e'-near clique in G of size \D\ > 5n. Then with 
probability at least 1 — i • e~ n ( € s ' pn ), Algorithm DistNearClique, running on G with parameters e,p, finds, 
in O (2 2pn ) communication rounds, a subgraph D' such that 

(1) D' is ^ pzig^ ' -near clique^ 

(2) \D'\ > (1 - ^e) \D\ - e~ 2 . 

2 For small enough e, say e < this is at most 24. 
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Proof: By Lemmas [5.11 and [5T2l the probability that the round complexity exceeds 2°( 2pn ) is bounded by e _2 F. 
By Lemma 1531 whenever assertion ((U) holds, assertion £T|) holds as well. Assertion © holds by Lemma 1531 
with probability at least 1 — -^e~ n ( e s 'P n ) . The theorem follows from the union bound. □ 
It may also be interesting to analyze the computational complexity of the vertices running the algorithm. 
A simple analysis shows that except for step |4f| of the exploration stage, the operation for each node can be 
implemented in poly(|5|) computational steps (on logn bit numbers) per communication round. In step l4fl 
however, the nodes need to "inspect" all their neighbors in order to determine whether they reside in T e (X). It 
is possible to reduce the complexity in this case by selecting a sample of the neighbors and estimating, rather 
than determining, membership in T e (X). Thus, the computational complexity can be reduced to poly (| SI) 
computational steps per round (for our purposes, \S\ < O(loglogn)). The analysis of this modification is 
omitted. 

6 Discussion 

On the impossibility of finding a globally maximal e-near clique. Our algorithm (when successful) finds a 
disjoint collection of near-cliques such that at least one of them is large. We note that it is impossible for a 
distributed sub-diameter time algorithm to output just one (say, the largest) clique. To see that, consider a graph 
containing an n/2- vertex clique A and an n/4-vertex clique B, connected by an n/4-long path P. The largest 
near-clique in this case is obviously A, and the vertices of B should output _L. However, if we delete all edges 
in A, the largest near-clique becomes B, i.e., its output must be non-_L. Since no node in B can distinguish 
between the two scenarios in less than \P\ = n/4 communication rounds, impossibility follows. 
Deriving distributed algorithms from property testers. Our approach may raise hopes that other property 
testers, at least in the dense graph modelo can be adapted into the distributing setting. Goldreich and Trevisan 
[11] prove that any property tester in the dense graph model has a canonical form where the first stage is se- 
lecting a uniform sample of appropriate size from the graph and the second is testing the graph induced by the 
sample for some (possibly other) property. Thus, the following scheme may seem likely to be useful: 

1. Select a uniform sample by having all nodes flip a biased coin. 

2. Find the graph induced between sampled nodes. This graph has very small (possibly constant) size. 

3. Use some (possibly inefficient) distributed algorithm to test it for the required property. 

In the distributed setting, however, sometime even testing a property for a very small graph would be impossible 
due to connectivity issues. As demonstrated above, there exist properties that are testable in the centralized set- 
ting and do not admit an efficient round-complexity distributed algorithm. The general method above, therefore, 
can only be applied in a "black-box" manner for some testers. 

Specifically, the p-clique tester presented in [ 10] does not comply with the above requirements (specifically, 
as we mentioned, the p-clique problem is unsolvable in small round-complexity). It can, however, be converted 
into a near-clique finder, in the sense defined in this work, using similar ideas and with worse parameters. 



3 We note that the dense-graph model is, in many cases, inadequate for modeling communication networks as such graphs are often 
sparse (and thus a solution for an e-close graphs is either trivial or uninteresting). 
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APPENDIX: Additional Proofs 



Proof of Lemma I5.lt The sampling stage requires no communication. Consider now the exploration stage. 
The BFS tree construction of Step[T]uses messages of 0(log n) bits (each message contains an ID and a distance 
counter), and its running time is proportional to the diameter of the component, which is trivially bounded by 
The number of rounds to execute Step [2] is proportional to the number of IDs plus the height of the tree, 
due to the pipelining of messages: the number of hops each ID needs to travel is at most twice the tree height, 
and a message needs to wait at most once for each other ID. It follows that the total time required for this step 
in 0(15*1) rounds. Step [3] takes at most maxj {|5j|} < \S\ rounds. Step |4al requires no communication. Step 
[4b]requires a node u to send or 1 for each subset of each component of 5* it is adjacent to. Since there may 
be at most 2^ s \ such subsets (over all components), this step takes at most 0(2^ s ^) rounds. In Step|4cJ each 
entry of the vector may be a number between and n, and hence the total number of bits in a vector is at most 
2\ s \ logn; using pipelining once again, we can therefore bound the number of rounds required to execute Step 
|4c]by 0(2l 5 l + \S\) = 0(2l 5 l). Similarly for StepsgdHIi Step@f|is local; In the decision stage, StepSJtakes, 
again, at most 0(2^ s ^) rounds. The remaining steps take at most 0(maxj \Si\) < 0(\S\) rounds. Thus the total 
round complexity of Algorithm DistNearClique is at most O (2l 5 l). □ 
Proof of Lemma l572t Follows from the Chernoff Bound, since in the sampling stage, each of the n nodes join 
S independently with probability p. □ 
Proof of Lemma 15.3b By counting. Recall that each undirected edge is viewed and counted as two anti- 
symmetrical directed edges. Define Y = K 2e 2(X). Consider a node v € T e (X). By definition of T e (X), 
v € K £ (Y),Le., 

\T(v)nY\ > (l-e)\Y\ . (3) 

Since T e (X) C Y, we have \T(v) nT e (X)\ > \T(v)nY\ > t - e \Y\ by Eq. ©. Since \Y\ < n, we can 
conclude that \T(v) n T e (X)\ > (1 - je)t. It follows that the total number of (directed) edges in T e (X) is at 
least (1 - f e)t(t - 1), as required. □ □ 

Proof of Lemma l5^4t Denote c = f |C| and d = f \D\. Since D is an (1 — e 3 )-near clique in G, we have that 

\En{DxD)\ > (l-e 3 )d(d-l) > (l-e 3 )d 2 -d. (4) 
By definition of C, if v e D \ C, then 

|fin(HxD)|<(l-e 2 )d. (5) 
Now, if we assume that c < (1 — e — -k^)d, we arrive at a contradiction to Eq. (01), since 

\En{DxD)\ = \EH (C x D)\ + \EC\ {(D\C) x D)\ 

< c-d+(d-c)(l-e 2 )d by Eq. © 
= (l-e 2 )-d 2 + e 2 -c-d 

< (l-e 2 )-d 2 + e 2 (^(l-e)d-^jd ifc<(l-e)d-^ 

3\ j2 



< (1 - e A )d A - d 



□ 



in 



Proof of Lemma |53t We show that a stronger property holds with that probability: namely, that the distance 
in S between any two nodes of X* is at most 2. By definition, X* C C, i.e., X* C K e i(D). It follows from 
the pigeonhole principle that every two nodes u, v € X* have at least (1 — 2e 2 ) \D\ common neighbors. The 
probability that none of these common neighbors is in S<® (i.e., that none of them has outcome heads for coiri2) 
is therefore at most (1 - p 2 ) (1_2e2)|D| < e -^- 2e ^ P2 \ D \ < e ~TsP\D\ for e < 1/3, and because p 2 > p/2 by 
definition. We now apply the union bound to obtain that Pr [diameter(X*) > 2] < \X*\ • e~Ts p|D| . Since X* 
is a random sample of C, it follows that E[|X*|] = p 2 • \C\ < p5n. Using a Chernoff bound, we obtain that 
Pr[\X*\ > 2p5n] < e - Q( - Spn \ Therefore, by the union bound 



Pr [diameter(JT) > 2] < Pr [diameter(JT) > 2 | \X*\ < 2p5n] + Pr [\X*\ > 2p5n] 

< 25pn ■ e' n( - Spn) + e - n{Spn) < e ' n{5pn) . 

Proof of ClaimS By definition, 

\C\T e (X*)\ < \C\K 2e ,{X*)\ + \C\K e {K 2e ,{X*))\ . 

We bound each term in Eq. © in turn. 

First, note that \K e 2(D) \ K 2e 2(X*)\ < e \C\, because X* is representative. It follows that 

\C\K 2e 2(X*)\<e\C\ , 

because C C K e 2(D). Note that Eq. ((TJ) also implies for e < 1/3 that 



□ 



(6) 



\K 2e2 (X*)\ > (l~e)\C\ > 



2\C\ 



(7) 



(8) 



We now turn to the second term of Eq. ([6]). X* is representative, and therefore \K 2e 2{X*) \ K 3e 2(C)\ < 
e 2 \C\, i.e., all but e 2 \C\ vertices of K 2e2 (X*) are neighbors of at least (1 - 3e 2 )[C| nodes of C. Let Y = 
C \ K € (K 2e 2(X*)), y = \Y\, and z = \K 2e2 (X*)\. Counting the number of edges between C and K 2e 2(X*) 
we conclude that y ■ (1 — e)z + (|C| — y) ■ z > [z — e 2 |C|) (1 — 3e 2 ) |C|, and plugging in Eq. © we obtain 



3e k 
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y.(l-e)z+ (|C| -y). z > z -(l- — )(1 - 36 2 ) |C| > (1 - — ) -z \C\ . 

Rearranging, we have y < ^ ■ \C\, and the claim follows. 
Proof of Claim|3j Since E[|X*|] = p\ \C\, and since membership in X* is determined independently for each 
node, we can apply the Chernoff Bound to obtain that 



□ 



Pr 



\X*\ -E[|X*|] > — E[|X*| 



< 2 exp 



< 2 exp 



e 4 Pi \C\ 
48 



Assume that 



\X* 



E[\X* 



< 



4 -E [ \X*\ ] , and let us consider the definition of a representative set. 

For item Q] let v G K e2 (D). Then \T(v) n C| > |C| - e 2 \D\ > \C\ - ^ |C| > (1 - |e 2 ) \C\. Since 
V{v) n X* is a random sample of T(v) n C, where each member is chosen with probability p\, we have that 
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E[\T(v)nX*\] = Pl ■ |r(u)nC| > (1 - h 2 ) Pl \C\ = (1 - §e 2 )E[|X*|]. Denote Y,, = \T{v)r\X*\. Then 



Pr[v £ K 2e *{X*)] = Pt[Y v < (l-2e 2 )|X*|] < Pr \Y V < (1 - 2e 2 )(l + -)E[| 



< Pr 



Y v - E[Y V ] < (1 - 2e 2 )(l + -)E[\X*\] - (1 - e 2 )E[\X* 



< Pv[Y v -E[Y v ]<-jE[\X*\}} 



< exp 



1 (inm) 2 

2 E\Y,] 



e\y; 



32 



< exp -_E[|X*|] < exp --. Pl |C| , 



32 



and therefore E[\K e2 (C) \ K 2e2 (X*)\] < exp 



32 



Pi \C\ ) ■ n. Using Markov's Inequality we obtain 



Pr 



\K e2 (C)\K 2e2 (X*)\>e\C\ 



< 4L • e -^-?>i|C|/32_ 



e\C\ 



A similar argument applies to item|2l Consider a node v K 3e2 (C) . Denote Y v = \T(v) n X*\. Then 
E[Y V ] < (1 - 3e 2 )E[|X*|], and therefore 



Pv[v e K 2e2 (X*) 



Pr[Y v > (1 - 2e 2 ) < Pr |y„ > (1 - 2e 2 )(l - — )E[|X*|] 



< Pr 

< Pr 

< exp 



Y v - E[Y V ] > (1 - 26 2 )(1 - j)E[\X*\] - (1 - 3e 2 )E[|X*|] 
3e 2 



Y v - E[Y V ] > —E[\X*\] 



1 / ^f-E[|X*|] 
"3 I E[Y V ] 



E[Y V 



< ,s P [-^.E[\X*\]\ < exp(-^r- Pl \C\ ). 



i.e., E[\K 2e2 (X*) \ K 3e2 (C)\] < exp (-^e 4 ■ p\ |C|) • n, which implies, as above, that 



Pr 



\K 2e2 (X*)\K 3e2 (C)\>e 2 \C\ 



< 



e 2 |C| 



-3e 4 -pi|C|/16 



Finally, we apply the Union Bound to that X* is representative with probability at least 



1 



e|C| ^ e 2 |C| 



-3e 4 -pi|C|/16^ > i 

y " ICI 



□ 
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